Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

searching "activitypub" on mastodon will return irrelevant message results from flipboard.com #30292

Open
filippodb opened this issue May 14, 2024 · 2 comments
Labels
area/search bug Something isn't working status/identified This bug has been identified

Comments

@filippodb
Copy link

filippodb commented May 14, 2024

Steps to reproduce the problem

1.search: "activitypub"

Expected behaviour

A full list of english articles about activitypub

Actual behaviour

random unrelated messages from flipboard.com

Detailed description

on mastodon.uno & mastodon.social when searching:

activitypub language:EN

it shows countless unrelated posts from flipboard.com!!

Mastodon instance

Mastodon.social

Mastodon version

v4.3

Browser name and version

Brave

Operating system

linux Fedora

Technical details

it happen also searching on other languages:

activitypub language:IT

@filippodb filippodb added area/web interface Related to the Mastodon web interface bug Something isn't working status/to triage This issue needs to be triaged labels May 14, 2024
@filippodb filippodb changed the title search actiyvitipub returns irrelevant message results from flipboard.com searching "actiyvitipub" on mastodon will returns irrelevant message results from flipboard.com May 14, 2024
@filippodb filippodb changed the title searching "actiyvitipub" on mastodon will returns irrelevant message results from flipboard.com searching "actiyvitipub" on mastodon will return irrelevant message results from flipboard.com May 14, 2024
@renchap renchap changed the title searching "actiyvitipub" on mastodon will return irrelevant message results from flipboard.com searching "activitypub" on mastodon will return irrelevant message results from flipboard.com May 14, 2024
@renchap
Copy link
Sponsor Member

renchap commented May 14, 2024

Thanks for your report, I have been able to reproduce it. I suspect this comes from the link included in the message, as it contains "activitypub". Maybe our tokeniser when indexing into ES is splitting the URLs into separate tokens?

For example such a link is https://techspot.com/news/102598-avast-free-antivirus-testing-features-learning-about-six.html?utm_source=flipboard&utm_medium=activitypub, and Flipboard adds the UTM parameters to every link.

@renchap renchap added status/identified This bug has been identified area/search and removed status/to triage This issue needs to be triaged area/web interface Related to the Mastodon web interface labels May 14, 2024
@jasonculverhouse
Copy link

jasonculverhouse commented May 14, 2024

I think that you are going to have to strip urls from plain text if you don't want them to be stemmed

Note that this will also happen if you search for amp. You will end up with every result that has more than one query parameter as they are encoded as & in the text. The standard tokenizer is going to index all of these under amp

https://github.com/mastodon/mastodon/blob/3a7aec2807089a004db90851c66db0a007a18a48/app/chewy/statuses_index.rb/#L30-L41

I would think that one could remove the URL's from the searchable_text that is indexed in the :stemmed field.

    field(:text, type: 'text', analyzer: 'verbatim', value: ->(status) { status.searchable_text }) { field(:stemmed, type: 'text', analyzer: 'content') }

There is also a https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

Strips HTML elements from a text and replaces HTML entities with their decoded value (e.g, replaces & with &).

Might help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/search bug Something isn't working status/identified This bug has been identified
Projects
None yet
Development

No branches or pull requests

3 participants